9 research outputs found
Improving Multi-Scale Aggregation Using Feature Pyramid Module for Robust Speaker Verification of Variable-Duration Utterances
Currently, the most widely used approach for speaker verification is the deep
speaker embedding learning. In this approach, we obtain a speaker embedding
vector by pooling single-scale features that are extracted from the last layer
of a speaker feature extractor. Multi-scale aggregation (MSA), which utilizes
multi-scale features from different layers of the feature extractor, has
recently been introduced and shows superior performance for variable-duration
utterances. To increase the robustness dealing with utterances of arbitrary
duration, this paper improves the MSA by using a feature pyramid module. The
module enhances speaker-discriminative information of features from multiple
layers via a top-down pathway and lateral connections. We extract speaker
embeddings using the enhanced features that contain rich speaker information
with different time scales. Experiments on the VoxCeleb dataset show that the
proposed module improves previous MSA methods with a smaller number of
parameters. It also achieves better performance than state-of-the-art
approaches for both short and long utterances.Comment: Accepted to Interspeech 202
Additional Shared Decoder on Siamese Multi-view Encoders for Learning Acoustic Word Embeddings
Acoustic word embeddings --- fixed-dimensional vector representations of
arbitrary-length words --- have attracted increasing interest in
query-by-example spoken term detection. Recently, on the fact that the
orthography of text labels partly reflects the phonetic similarity between the
words' pronunciation, a multi-view approach has been introduced that jointly
learns acoustic and text embeddings. It showed that it is possible to learn
discriminative embeddings by designing the objective which takes text labels as
well as word segments. In this paper, we propose a network architecture that
expands the multi-view approach by combining the Siamese multi-view encoders
with a shared decoder network to maximize the effect of the relationship
between acoustic and text embeddings in embedding space. Discriminatively
trained with multi-view triplet loss and decoding loss, our proposed approach
achieves better performance on acoustic word discrimination task with the WSJ
dataset, resulting in 11.1% relative improvement in average precision. We also
present experimental results on cross-view word discrimination and word level
speech recognition tasks.Comment: Accepted at 2019 IEEE Automatic Speech Recognition and Understanding
Workshop (ASRU 2019
Does Technological Innovation Really Reduce Marginal Abatement Costs? Some Theory, Algebraic Evidence, and Policy Implications
Marginal abatement costs, Production process innovations, Technological change, O38, Q28,